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Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting 
human disease. But after an initial genome scan has identified a "hit region" of association, single-locus approaches can falter. 
Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous. 
Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs, 
with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability. 
Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint 
model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising 
but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general 
method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong 
LD. Our method, LASSO local automatic regularization resample model averaging (LLARRMA), combines LASSO shrinkage 
with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included 
in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control 
genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies 
a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of 
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INTRODUCTION 

Single locus regression has become a staple tool of hu- 
man genome-wide association studies (GWAS; WTCCC 
[2007]). Despite the fact that it simplistically reduces the 
often complex genetic architecture of a phenotype down 
to effects at an individual single nucleotide polymor- 
phism (SNP) (or other localized variant), it has proved 
powerful in identifying major genetic determinants and 
predictors of disease susceptibility [Cantor et al, 2010]. 
Many would acknowledge that simultaneous modeling 
of all loci potentially yields fairer estimates of genetic 
effect, more stable phenotypic predictions, and better 
characterization of between-locus confounding [Hoggart 
et al., 2008; Lee et al., 2008]. However, such multiple lo- 
cus approaches are at present seldom used. This could be 
because they are considered impractical, potentially hard 
for readers to understand, or, with some theoretical sup- 
port [Fan and Lv, 2008], unnecessary in an initial genome 
scan. Certainly, much of the genome-wide confounding that 
explicit multiple locus modeling would hope to resolve is 



efficiently, if bluntly, dealt with by the addition of regres- 
sion covariates correcting for higher order geometric rela- 
tionships in the data [Price et al, 2010] or probabilistically 
inferred strata [Pritchard et al., 2000]. 

Nonetheless, once initial genome scans have been per- 
formed and "hit regions" of association identified, short- 
comings of a single-locus approach become apparent. Local 
patterns of linkage disequilibrium (LD) in such hit regions 
can make ambiguous both the number of underlying true 
signals and the identity of the loci that most directly give 
rise to them [e.g., Strange et al., 2010]. Statistical analysis 
after this point is often ad hoc. It typically involves fitting 
further regressions that condition on "top" loci that appear 
most strongly associated in order to rule out neighbors or 
rule in suspicions of an independent second signal [Barratt 
et al., 2004; Udea et al, 2003]. This is followed by more in- 
terpretive analysis based on annotation as a prelude to, for 
example, investigation at the bench. In ad hoc condition- 
ing, rarely is there formal consideration of the fact that the 
association of the top locus is often insignificantly different 
from that of its correlated neighbors, and that whereas its 
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association with the phenotype is probably stable to sam- 
pling error, its superiority in association over its neighbors 
is probably not. This inherent instability of the relative 
strengths of association between confounding loci makes 
such strategies high risk: a slightly different sampling of in- 
dividuals could demote the conditioning locus, result in an 
alternative conditioning locus being chosen, and potentially 
lead to altered conclusions. This approach becomes yet more 
unstable when some of the loci are themselves known with 
varying certainty, their genotypes having been partially or 
wholly imputed [Zheng et al., 2011], such that weakness 
of association is now also a function of imputation uncer- 
tainty unrelated to the phenotype [e.g., Servin and Stephens, 
2007]. 

There is thus great value in developing principled ap- 
proaches to discriminate true from false signals in hit re- 
gions. Joint modeling of all loci through multiple regression 
seems attractive because it accounts for the LD of the data 
[Balding, 2006]. Standard regression is unsuitable for this 
purpose, however, because even when the number of con- 
sidered loci p is much fewer than the number of individ- 
uals n, LD creates multicollinearity that derails meaning- 
ful estimation of locus effects. Stepwise multiple regression 
techniques [Cordell and Clayton, 2002] formalize the ad 
hoc conditioning approach but also inherit its weaknesses: 
model selection choosing a single set of active loci typically 
provides no indication about how sensitive that choice was 
to, for example, sampling variability, making it a statistic 
that is opaque at best and misleading at worst. Bayesian 
approaches offer a coherent perspective by formally ac- 
counting for uncertainty in model choice, effect estimation, 
and imputation uncertainty [Stephens and Balding, 2009]. 
Nonetheless, these are often highly computationally inten- 
sive, and require formal statements of prior belief relating to 
the number of causal variants and their effects that analysts 
may feel unprepared or unwilling to specify. 

Penalized regression models can provide an alternative 
that does not require a commitment to Bayesian learning. 
Placing a penalty on the size of coefficients in the multiple re- 
gression leads to moderated estimates of coefficient effects, 
allowing their stable estimation even when many predic- 
tors are in the model. In particular, the LASSO [Tibshirani, 
1996], which penalizes increases in the absolute value of 
each coefficient subject to a penalty parameter X, results in 
some effects being shrunk to exactly zero. The result is a 
"sparse" model in which only a subset of effects are ac- 
tive. Increasing the level of penalization leads to greater 
sparsity effectively making X a continuous model selection 
parameter. Recent advances in fitting LASSO-type models 
have made them more practical for analysis of large-scale 
genetic data [e.g., Wu et al., 2009]. Nonetheless, as a tool for 
modeling effects at multiple loci, the LASSO leaves impor- 
tant questions unanswered. One problem is how to select A.. 
This is typically approached through criteria-based evalua- 
tion methods [Wu et al., 2009; Zhou et al, 2010], such as AIC 
and BIC, empirical measures of predictive accuracy (such 
as cross-validation [Friedman et al., 2010]), or criteria aim- 
ing to control type I error (such as permutation [Ayers and 
Cordell, 2010]). Another problem is, given X, how to charac- 
terize uncertainty in model choice. Although LASSO mod- 
erates estimated effects through shrinkage, it is no better 
than stepwise methods in that it ultimately selects a single 
model (or single "path" of models, when X is varied), and 
thus states with absolute confidence a statistic that could in 
fact be highly sensitive to the sampling of observations. 



An intuitive way to characterize variability of model 
choice is to estimate for each locus a model inclusion prob- 
ability (MIP). A Bayesian approach would formulate this as 
a posterior probability that conditions on both the observed 
data and prior uncertainty in model choice. The Bayesian 
MIP embodies a statement about whether the researcher 
should believe the locus is included in the true model. A 
frequentist alternative is to formulate the MIP as the prob- 
ability a locus would be included in a sparse model under 
an alternative realization of the data. This frequentist MIP 
is thus a statement about the expected long-run behavior 
of the model selection procedure. Valdar et al. [2009] pro- 
posed an approach that applied forward selection of genetic 
loci to resamples of the data and defined the resample MIP 
(RMIP) as the proportion of resampled datasets for which a 
locus was selected. This resample model averaging (RMA) 
approach used either bootstrapping (i.e., "bagging") or sub- 
sampling (i.e., "subagging"), and followed an earlier appli- 
cation to genome-wide association in Valdar et al. [2006] and 
work on general aggregation methods by Breiman [1996] 
and Biihlmann and Yu [2001] (cf. parallel applied work 
by Austin and Tu [2004] and Hoh et al. [2000]). Indepen- 
dently, Meinshausen and Biihlmann [2010] proposed "Sta- 
bility Selection" (SS) that powerfully combines subagging 
with LASSO shrinkage to produce a set of frequentist MIPs 
at each specified X. Recently, Alexander and Lange [2011] 
adapted this method with limited success to whole-genome 
association. 

Herein, we propose a statistical method for reprioritizing 
genetic associations in a hit region of a human GWAS based 
on case-control data that exploits and extends the resam- 
ple aggregation techniques developed in Valdar et al. [2009] 
and Meinshausen and Biihlmann [2010]. We demonstrate a 
principled approach, LASSO local automatic regularization 
resample model averaging (LLARRMA), that characterizes 
sensitivity of locus choice to sampling variability and un- 
certainty due to missing genotype data, and that provides 
LASSO shrinkage automatically regularized through either 
predictive- or discovery-based criteria. We show that when 
multiple correlated SNPs are present in a hit region that has 
been identified by standard single-locus regression, LLAR- 
RMA produces a reprioritization that is enriched for true 
signals. 

METHODS 

We start by considering a standard logistic regression to 
estimate the effects of m SNPs in a hit region on a case- 
control outcome in n individuals, and then describe statis- 
tical approaches to identify a subset m q of SNPs that rep- 
resent true signals. Herein, we define a "true signal" as the 
SNP that most strongly tags an underlying causal variant, 
a "background" SNP as an SNP that is not a true signal, 
and an optimal analysis as one that distinguishes true sig- 
nals from background SNPs in the hit region. We assume 
that the hit region has been previously identified by an ini- 
tial genome-wide screen using, for example, single-locus 
regression, that many of the m SNPs may be in high LD, 
and that m q < m < n. Let y = (yi, . . ., y n ) be an w-vector of 
the dichotomous response with each of the «i cases coded 
by 1 and the n 0 controls coded by 0, let X be an n x m matrix 
of SNP genotypes, where SNPs are coded to reflect additive- 
only effects as (0, 1, 2} for unphased genotypes {qq,qQ,QQ}, 
and let V — [y, X) and A/" = {1 n}. Logistic regression 
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models the case-control status of individual i as if sampled 
from Yi ~ Bin(p, , 1), where i's propensity p, — P(Yj — 1) is 
determined by a linear function of the m SNP predictors 



(v m 
Pt j - =1 



(1) 



where x,y is the value of the ;th SNP for the 2 th individual 
and the i/fh element of the column-centered design matrix 

X, |x is the intercept, and p = (Pi p,„) are the effects of 

the m predictors. 

We assume that only a subset of the m SNPs are true sig- 
nals, and define a corresponding vector of 0-1 inclusions 
7 = (71, . . . , 7,,,) such that yj = ^ 0). A common way 
to infer 7, and to thereby estimate the identity of the true 
signal, is to use a model selection procedure that maximizes 
some criterion of fit. This returns a binary vector 7, a hard 
estimate declaring which SNPs belong to the model. Al- 
though superficially attractive, 7 has limited interpretabil- 
ity because it provides no information about how sensitive 
the selection could have been to finite sampling. That is, 
whether 7 would be expected to vary dramatically when 
applied to alternative samples from the same population. 
Moreover, although many selection procedures guarantee 
that they will deliver the correct result in an infinite sample 
(i.e., are consistent), this offers little reassurance when the 
sample is finite, and suggests that the returned statistic 7 
could have high variance. 



subsample k as 



P(X; 2>«) = argmin „ j - <(P; V®) + X £ I P, I j 



(3) 



where l($;V (k) ) is the log-likelihood of P for data V (k) , and 
X is a penalty parameter. The LASSO estimate P(X; eas- 
ily translates into an estimate of the inclusions y(X;T>^ k ' 1 ) = 
I(P(K;T> {k ^) 0). Nonetheless, to arrive at a single estimate 
of 7, as required for model averaging, we must devise a 
suitable criterion for choosing the penalty X. We propose 
two alternatives, both of which identify a value specific 
to subsample k (i.e., local): complement deviance selection 
and permutation selection. 

Predictive-Based Choice of \ (fc) : Complement 
Deviance Selection. The complement deviance crite- 
rion seeks a model that would perform well in out-of- 
sample prediction. After estimating ^(K;T>^) over a grid 
of X to calculate the LASSO path, this criterion finds the 
value of X that minimizes the deviance of the complement 
of subsample k, i.e., 



CompDev 



; argmin x 



-2^)[ W log(ft.0 + (l 



y,)log(l-p,x)] 



LLARRMA 

Resample Model Averaging. We seek to estimate 
7 in a way that incorporates uncertainty in model choice 
arising through, for example, potential variability of the se- 
lected set due to finite sampling. To do this we use RMA 
[Valdar et al, 2009], applying a model selection procedure 
to repeated resamples of the data, and basing subsequent 
inference on the aggregate of those results. Rather than ob- 
taining a binary estimate of each yj, we instead seek to 
estimate its expectation E(7;) over resamples, hoping to ap- 
proximate its expectation over samples from the popula- 
tion. We start by drawing subsamples k — 1, . . . , K with 
subsampling proportion 4> = §, such that each subsample 
comprises data 2? (fc) = {y ( *\ X (f:) } on |A/" (,c) | = 4>n individuals 
N*® C TV. Each subsample is produced by drawing fyrii in- 
dividuals at random without replacement from the n% cases, 
and 4>« 0 individuals at random without replacement from 
the n 0 controls. For each subsample k, we perform a fixed 
model selection procedure to estimate y(D^) — y^, the m- 
length binary vector of inclusions based on the fcth subsam- 
ple. Applying this to all subsamples gives the K x m matrix 
r, where T T = [7®, 7 (2 \ ■ • • , 7 (K ']. The expected proportion 
of times that the ;th predictor is included in the model is 
given by its RMA estimate 



(2) 



which we refer to as its RMIP 
Selection Within a Subsample Using the Lasso. 

To select SNPs within the kih subsample, we use LASSO 
penalized regression [Tibshirani, 1996]. This estimates P for 



where A/" (w = A/"\A/" (,:) is the set of (1 — §)n individuals not 
selected for subsample k, and p,^ is the predicted probabil- 
ity of P(Yj — 1) based upon P(X; V^) applied to the design 
matrix of the complement subsample X ft,c) . 

Discovery-Based Choice of \ <fc) : Permutation 
Selection. The permutation selection criterion is a mod- 
ified version of that proposed by Ayers and Cordell [2010] 
and seeks a conservative model that would tend to include 
no SNPs under permutation of the response. Given a sub- 
sample k, we estimate for a given permutation of the re- 
sponse ir(y) the smallest penalty required to zero out all 
predictors, i.e., 



Xnuii(iT, k) = — ^-maxy \ {xf , 7r(y w )> 



I AW I 



where is the jth column of the subsampled and mean- 
centered design matrix X (,c) , and ( • , • ) denotes the inner prod- 
uct of its two arguments. Calculating this for each of S per- 
mutations wi, us, we estimate the permutation selec- 
tion X for subsample k as 

x Perm = median({X nuU (ir 1 , k), \ nu u(n 2 , fe) KxM^s, k)}) ■ 

(4) 

Ayers and Cordell [2010] apply a similar criterion when 
analyzing complete datasets, with the difference that they 
estimate X nu n as the maximum of {X nuU (n 1 ), . . . , X nu ii(iT S )} 
for S — 25. We prefer not to do this because the maxi- 
mum is relatively unstable for S = 25, and is undesirable for 
larger S because it potentially allows X nu u = X nu u(n s ) where 
tt s (y) = y. In contrast, when using the median (Equation 4) 
the accuracy of Xp erm increases with S, although we find that 
in simulations S = 20 is adequate. 
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Fig. 1. A comparison of LLARRMA and Stability Selection. 



Incorporating Uncertainty Due to Missing Geno- 
types: Hard, Dosage, and Multiple Imputation. 

SNP data within a hit region will often include combina- 
tions of markers and individuals for which the genotype 
is unknown or uncertain. To avoid a potentially wasteful 
complete cases analysis, it is common to impute the missing 
genotypes using a program such as MACH [Li et al, 2010], 
IMPUTE [Howie et al, 2009], or fastPHASE [Scheet and 
Stephens, 2006], and analyze the partly imputed data 
as if it were fully observed. Imputation methods are 
typically based on reconstruction and phasing of inferred 
haplotypes. Dividing the SNP matrix X into missing 
and observed elements X = {X mis , X obs ), methods such as 
fastPHASE [Scheet and Stephens, 2006] model the joint 
distribution p(X mis \ X obs , w), where w includes additional 
information used in the imputation (e.g., priors). Most 
GWAS, however, do not use this joint distribution directly. 
Rather, they replace X^ with a point estimate X mis , each 
element of which is constructed from its marginal distribu- 
tions. Specifically, X mis is replaced by either the "dosage," 
■^miT' with elements defined as the expectation of the allele 
count Xjj — E(x it \X ohs , <o); or a "hard" imputation, X^ A , 
with elements imputed as their maximum a posteriori 
genotype 

Xij = argmax ?£(012] p(x, 7 = g\X ohs , to). 

The simplest approach to modeling missing genotypes 
within LLARRMA is first to estimate X mis as either X^ e 
or X]™* d and then subsample X = {X mis , X ohs } as if it were 
complete. This plug-in approach underestimates variabil- 
ity because it fails to incorporate uncertainty about the 
imputation. Zheng et al. [2011] show that doing this when 
modeling effects at single loci reduces power by a negligible 
amount when the imputation accuracy is reasonably high. 
Nonetheless, ignoring imputation uncertainty could be 
more problematic in multiple-locus settings, if, for example, 
the posterior distribution of haplotypes p(X mis \X obs , w) 
differs substantially from joint distribution implied by the 
product of marginal posteriors n^e.*^, p( x ij\X 0 bs, w) [e.g., 
Servin and Stephens, 2007]. A natural way to incorporate 
imputation uncertainty into our resampling framework is 
through multiple imputation [Little and Rubin, 2002]. At 
each iteration k, we sample a new X^ is from its posterior 
p(Xwis\Xd», *»)/ subsample the resulting X* = [X^, X obs ] 
to give {X*, y} {k) = V* (k) , and then calculate RMIPs using 
•y(XH*>) in place of y(D^) in Equation 2. The resulting 



RMIPs incorporate additional variability because each 
subsample now includes a potentially different imputa- 
tion of missing genotypes. We implement hard, dosage, 
and multiple imputation using posterior draws from 
fastPHASE (making use of the -s option). 

COMPETING METHODS 

LLARRMA calculates a score (an RMIP) for each SNP in 
a case-control study. We compare the ability of those scores 
to discriminate true signals from background with the SNP 
scores calculated by two alternatives: the traditional GWAS 
approach of single-locus regression, and the LASSO-based 
subsample model averaging method stability selection 
(SS) recently proposed in a more general context by 
Meinshausen and Buhlmann [2010]. 

Single Locus Regression. We perform single locus 
regression with logistic regression as used in, for exam- 
ple, PLINK [Purcell et al, 2007]. For each SNP, we fit a 
single-predictor version of Equation 1 and score its — log 10 P 
(logP), where P is the P-value from a likelihood ratio test 
against an intercept-only model. 

Stability Selection. SS differs from LLARRMA in 
two main respects (see Fig. 1). First, whereas LLARRMA 
selects variables within each subsample using a local (i.e., 
subsample-specific) penalty X*', SS uses a single global 
penalty X applied to all K subsamples. Second, whereas 
LLARRMA chooses each X (fc) automatically, SS leaves its 
global X as a free parameter. In SS, the RMIP (referred to as 
the "selection probability" in Meinshausen and Buhlmann 
[2010]) is thus left as a function of X, 

. i K 

RMIP SS (X), = -£ Z(P(\;© W ); # 0) (5) 

fo=i 

giving rise to a sequence of RMIPs (a "stability path") for 
each locus j . Meinshausen and Buhlmann [2010] provide 
little guidance for choosing X. As a choice of X is required 
to produce a unique RMIP and thereby ensure meaningful 
comparison with LLARRMA, we select X to produce the 
stiffest possible competition: as the value that maximizes 
the criterion used for comparing methods. Specifically, 
given a criterion of success u(y, y) comparing truth 7 with 
guess y, we define 

Xpracie = argmax x u(y, RMIP SS (X)) , 
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where "oracle" reflects the fact that choosing this unfairly 
advantageous value requires foreknowledge of 7. We 
consider SS with the oracle property defined by setting u to 
be the initial area under the curve (AUC) (described below). 

ROC-BASED EVALUATION 

We assess the performance of LLARRMA and its com- 
petitors by simulation, examining the ability of each to dis- 
criminate true signals from background in simulated case- 
control studies. Performance is evaluated formally using 
receiver operator characteristic (ROC) curves. ROC curve 
methodology can vary between studies [Krzanowski and 
Hand, 2009], so we describe ours in full. A given simulation 
study comprises a set of simulation trials 5 = {1, . . . , S}. In 
each trial s, a given method is presented with m SNPs of 
which m q will be a true signal. That method calculates a 
single score for each SNP (an RMIP or logP). For a given 
threshold t, define power, (f) as the proportion of m q true 
signal SNPs scoring > t (i.e., the power to detect), and the 
false-positive rate FPR s (f) as the proportion of the m — m q 
background SNPs scoring > t (i.e., the false positive rate; 
FPR). We define the area under curve in trial s for FPRs 
between a and b as AUC, (a , b) — f power, (FPR^ 1 (x)) dx, 
where FPR7 1 (x) returns the threshold t at which the FPR 
is x, and the integration is approximated using the trapez- 
ium rule. For a given method and set of simulations S, 
we d efine the estimated AUC between FPR a and b as 
AUC(fl, b) — XLes S _1 AUC(s, a, b), and assume this esti- 
mate to be approximately normally distributed with vari- 
ance (S - l)- 1 £ se5 (AUC> , b) - AUC(s, a , b)\ . We define 
the "initial ROC" as the ROC curve in the range FPR e 
[0, 0.05], and the "initial AUC" as AUC(0, 0.05); the "full 
ROC" is where FPR e [0, 1] and the "full AUC" is AUC(0, 1). 
When plotting ROC curves for each method, we use thresh- 
old averaging [Fawcett, 2006], varying t over its range ([0, 1] 
for RMIPs; [0, 00) for logP) and at each t plotting x and y co- 
ordinates S" 1 ^sesFPRsW an d £] s& sPOwer s (f), respec- 
tively. 

SIMULATION STUDY 1: FIVE LOCI IN CANCER 
DATA 

We obtained genotype data from phase 1 of a case-control 
GWAS for colorectal cancer from collaborators at the Well- 
come Trust Centre for Human Genetics, University of Ox- 
ford. Two forms of the data are used here. The "cancer data" 
comprise complete genotype information on 1,493 subjects 
for 183 SNPs covering a hit region previously identified on 
18q21. The cancer data are a subset of the "full cancer data," 
which comprises incomplete genotype information on 1,859 
subjects for the hit region. 

Generating Missing Genotypes. To assess the sen- 
sitivity of the compared methods to alternative strategies 
for modeling missing genotypes, we generate incomplete 
versions of the cancer data by deleting genotypes accord- 
ing to a random missingness algorithm. The missingness 
algorithm is based on empirical modeling of the pattern of 
missing data in the full cancer data. The full cancer data 
genotypes contained 854 missing genotypes (~0.25%). We 
observed that the proportion of missing genotypes varied 
considerably from SNP to SNP, but that missingness across 
individuals was consistent with a random allocation. To 



generate each incomplete dataset, we therefore do the fol- 
lowing. First, for each SNP j, we assign a missingness pro- 
portion ipmis,; generated as a random draw ipous,; ~ /mis, 
where f mis is an empirical density based on the histogram 
of missingness proportions of SNPs in the full cancer data. 
Second, we select a subset of < n individuals eligi- 
ble to receive missing genotypes. Third, at each SNP /' 
we delete d, = n mis x min(c(|j mis 1) marker genotypes at 
random from the n mis individuals, where c is chosen such 
that the overall proportion of missing data is fixed value 
p mis — (mn)' 1 £ ■ dj. To generate a more conservative level 
of missingness while ensuring at least 10% of individuals 
had complete data, we set p mis = 0.1 and w m i S = 0.9n. 

Simulating Phenotypes. Phenotypes are simulated 
based on a binomial draw from the logistic model in Equa- 
tion 1. Given a set of SNPs representing true signals, with 
genotypes X, ; and effects p,,, we first calculate the inter- 
cept necessary for an expected 50/50 ratio of cases to con- 
trols as |x = — n _1 l T (XjP 9 ), calculate individual propensi- 
ties as pi = logit _1 (|jL + x^P 9 ), and then draw phenotypes 
as Y) ~ Bin(l, pi). 

Placing Causal Loci. To ensure a degree of confound- 
ing correlation between loci, we choose five true signal SNPs 
at random but in a restricted manner from the LD blocks 
shown in Figure 2. Specifically, in each simulation trial, two 
SNPs are chosen from block 1 at random but subject to cor- 
relation r > 0.4, two SNPs are from block 2, also subject to 
r > 0.4, and one SNP is randomly chosen from block 3. 

Simulation 1A: Moderate Effects. To aid an ini- 
tial illustrative comparison between methods, our first 
study on the cancer data simulates a relatively constant 
effects structure. In each simulation trial, we assign a 
permutation of the effects (on the odds scale) exp(P,} = 
(1.287, 1.398, 1.246, 1.357, 1.419) to the selected five SNPs. 

Simulation IB: Small Effects. Providing a more 
challenging and variable set of causal targets, our sec- 
ond study on the cancer data randomly chooses true sig- 
nal SNPs as in 1A but draws each element (3, ;i of effects 
P, independently as exp{(3, (/ ! ~ N(1.25(-l) v ' , 0.02 2 ) with 
Vj ~ Bin(l, 0.5). The resulting effects are comparable to the 
small effects estimated in many GWAS [Manolio et al., 2009] . 

SIMULATION STUDY 2: ONE TO SEVEN LOCI 
IN '58 DATA 

The '"58" data are a complete-genotypes subset of data 
collected during the human GWAS for seven diseases de- 
scribed in WTCCC [2007] . It comprises genotypes for 2,199 
subjects on 500 SNPs in the region 39.063723-10.985321 Mb 
on chromosome 22, this region being chosen by us as a con- 
tiguous run of markers that exhibits a mixture of high and 
low LD (Fig. 2). To assess the how the number of true sig- 
nals affects the relative utility of modeling single vs. multi- 
ple loci, we evaluated methods in seven distinct simulation 
substudies, simulating 1, . . . , 7 true signals, respectively. In 
each simulated trial of each substudy the set of true signals 
is chosen at entirely random from the 500 SNPs and the SNP 
effects are generated as in simulation IB above. 

COMPUTATION 

Genotype imputation was performed using fastPHASE 
[Scheet and Stephens, 2006]. All other analyses were per- 
formed in R [R Development Core Team, 2011], with the 
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Fig. 2. LD structure of the two genotype datasets used in the simulations. Shading indicates pairwise LD between SNPs, ranging from 
r 2 = 0 (white) to r 2 = 1 (black). Red highlighting shows blocks where true signals were placed in simulation studies 1A and IB. 



glmnet package [Friedman et al v 2010] used for fitting 
LASSO models. On a 2.4 Ghz MacBook Pro with 4 Gb RAM, 
on average 100 subsamples on the cancer data take the fol- 
lowing times: LLARRMA with permutation selection, 39.8 
sec (SD = 2.6 sec); LLARRMA with complement deviance 
selection, 389.2 sec (SD = 64.6 sec); SS, 305.7 sec (SD = 48.5 
s). Use of multiple/hard/ dosage imputed data incurs negli- 
gible extra computation, assuming the imputation itself has 
been done in advance. 



RESULTS 

SIMULATION STUDY 1A: MODERATE LD, 
MODERATE EFFECTS 

We simulated 1,000 case-control datasets based on the 
cancer data (see Methods and Fig. 2). Each simulated dataset 
had approximately balanced cases and controls, with in- 
dividuals' outcomes influenced by five SNPs of moder- 
ate effect (odds ratios 1.246-1.419) out of 183 SNPs in to- 
tal, and existed in both a complete form, referred to as 
the "complete" dataset, and an incomplete form, in which 
some genotype values were set to be missing. The incom- 
plete form was available in three alternative imputations: a 
"hard" imputation, a "dosage" imputation, and an ensem- 
ble of 100 sampled imputations that constituted a single 
"multiple" imputation set (these imputations being gener- 
ated by fastPHASE [Scheet and Stephens, 2006]). At each 
simulation, we tested four different analysis methods that 
each produced a score per SNP Our subsequent compar- 
isons of those methods were based on how well their scores 
discriminated the five SNPs that represented true signals 
from the 178 that did not. The four methods examined 
were (short names in parentheses): single SNP logistic regres- 
sion (single-locus regression); LLARRMA using permutation 
selection (permutation selection); LLARRMA using comple- 
ment deviance selection (complement deviance); SS using or- 
acle penalization (oracle SS). All methods were applied to 
the complete, hard imputation, and dosage imputation ver- 
sions of each simulated dataset; resample-based methods 
(i.e., all except single-locus regression), which were set to 
use K = 100 subsamples, were also applied to the multiple 
imputation set. 

An Example Simulation. Figure 3 plots SNP location 
against SNP-score for each method in an example simula- 
tion applied to complete data. True signal SNPs are plotted 
as black crosses and the remaining (background) SNPs as 
gray dots. In single-locus regression (Fig. 3A), SNPs are 
scored as — log 10 P (logP; see Methods). Although the true 
signals between 1 and 50 tend to attract higher scores, so do 
many of the backgrounds SNPs between 1 and 60, giving 
rise to a cloud of association that is characteristic of many 
hit regions in real GWAS. The remaining methods (3 B-D) 
report inclusion probabilities (RMIPs) for each SNP. These 
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Fig. 3. Results of four methods applied to an example case-control 
dataset from simulation study 1A. Plots show SNP score (logP or 
RMIP) against SNP location in the cancer data, with true signal 
SNPs in black and background SNPs in gray. 



describe a frequentist probability that each SNP would be 
included in a sparse model that seeks to estimate the joint 
effects of multiple SNPs. Because SNPs compete with each 
other for inclusion in these methods, the resulting scores 
more clearly differentiate those SNPs. In this example, that 
increased sparsity coincides with the set of higher scored 
loci being more enriched for true signals than is the case 
with single-locus regression. 

Results From 1,000 Simulations. Figure 4 plots 
ROC curves (see Methods) for each of the four methods, 
with single locus regression applied to complete genotype 
data and resample-based methods applied to genotype data 
with ~10% missingness that has been multiply imputed (see 
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Falsa Positive Rate False Positive Rate 

Fig. 4. Receiver operator characteristic (ROC) curves for simulation study 1A: moderate SNP effects in a hit region of moderate LD. Each 
curve is based on 1,000 simulations. Right plot shows the full ROC curve; left plot shows a zoomed section focusing on the top-scoring 
SNPs. 
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Fig. 5. Area under the ROC curve (AUC) for four methods applied to four types of imputed genotype data in simulation study 1A: 
moderate SNP effects in a hit region of moderate LD. Each AUC estimate is based on 1,000 simulations and is plotted as the mean (dot) 
with 50% CI (thick bar) and 95% CI (thin bar). 



Methods and below). The ROC curve plots the trade-off be- 
tween power (the proportion of true signals declared as 
influential) and FPR (the proportion of background SNPs 
declared as influential) when thresholding the SNP scores 
(logPs or RMIPs) at different values. The initial ROC is ar- 
guably of greater relevance to GWAS than the full ROC 
because it focuses on enrichment of true signals among the 
top-scoring SNPs. A method whose top four SNPs are true 
signals, but which never finds the fifth true signal SNP, is ar- 
guably more valuable than one whose top SNPs are false but 
which finds all five true signals among its middle scoring 
SNPs. Figure 4 shows both the full ROC curve (right) and 
the initial ROC curve (i.e., where FPR < 5%; left). Figure 5 
plots the AUC for the initial and full ROC curves for all four 
methods under four conditions: where the available geno- 
type data are complete, or have ~10% of genotypes missing 
but available in hard-, dosage- or multiply-imputed forms. 
All point estimates (plotted curves in Fig. 4 and mean AUCs 
in Fig. 5) are based on averages over the 1,000 simulations. 

Figure 4 shows, for this example, that single-locus regres- 
sion most powerfully discriminates true signals from back- 
ground when the experimenter is prepared to follow up 10% 
or more of the available SNPs. When at most the top 5% of 
SNPs can be followed up, however, single-locus regression 
is dominated by LLARRMA's permutation selection. When 
follow-up is restricted to the top ~2% of SNPs, single-locus 
regression is also dominated by complement deviance and 
oracle SS. Figure 5 echos these trends. It also shows how 
the methods perform under different forms of imputation, 



although no consistent pattern emerges favoring one form 
over the others. 



SIMULATION STUDY IB: MODERATE LD, 
SMALL EFFECTS 

We performed a second set of simulations with a design 
identical to 1A above except with smaller SNP effects (odds 
ratios around 1.25). The results in Figures 6 and 7 show that 
although some of the LLARRMA methods dominate in the 
first third of the initial ROC curve, they generally offer lit- 
tle improvement over single-locus regression under these 
conditions. The poor performance of oracle SS is striking. 
"Oracle" refers to the fact that SS was applied in a way that 
required foreknowledge of the answer: in this case, their free 
parameter X was set to maximize their initial AUC. The fact 
that LLARRMA does not have this oracle advantage and yet 
still dominates oracle SS suggests a systematic shortcoming 
of SS in this setting. Figure 8 helps explain the phenomenon. 
It plots values of the selection parameter X as estimated by 
LLARRMA and SS in a representative set of 50 of the 1,000 
simulations (using the complement deviance criterion for 
LLARRMA, which seeks to maximize out-of-sample pre- 
dictions). Each vertical series shows estimated X's from one 
simulation: gray crosses show the K — 100 distinct values 
of X (i.e., X (1) , X <2) , . . . , X (K) ) estimated for, and used for selec- 
tion in, subsamples k = 1 K by LLARRMA; black plus 

signs show the single X chosen by oracle SS to be applied 
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Fig. 6. ROC curves for simulation study IB: small SNP effects in a hit region of moderate LD. Each curve is based on 1,000 simulations. 
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Fig. 8. Choice of penalty parameter \ by oracle Stability Selection (black pluses; used to order vertical series along the x-axis) vs. 
subsample-specific choice by LLARRMA complement deviance selection (gray crosses) in 50 representative simulation trials out of 1,000 
performed for simulation study IB. 



uniformly across all K subsamples. The plot thus contrasts 
locally optimal vs. globally optimal choices of the selection 
parameter. In particular, if a dominating strategy allows 
each subsample k to have its own then this plot illus- 
trates how even the best choice of global X corresponds to a 
suboptimal local A*' for most subsamples. 



SIMULATION STUDY 2: STRONG LD, SMALL 
EFFECTS 

To examine the relative performance of the single and 
multiple locus methods in a more challenging setting, we 
simulated 700 case-control datasets based on the '58 data, 
a region on chromosome 18 containing blocks of strong LD 

Genet. Epidemiol. 



from the GWAS of WTCCC [2007] (see Methods and Fig. 2). 
Each simulated dataset had a complete set of genotypes 
and approximately balanced cases and controls. Individu- 
als' outcomes were influenced by between one and seven 
true signals of small effect (allelic odds ratios of around 
1.25), with 100 simulations devoted to each simulated num- 
ber of true signals m q — 1, ... ,7. Figure 9 shows the initial 
and full ROC curves from the 100 simulations in which five 
true signals were simulated. In this high correlation — weak 
signal setting, all forms of LLARRMA dominate single- 
locus regression in the initial ROC curve, suggesting an 
advantage of simultaneously modeling multiple loci in the 
presence of high LD. By contrast, oracle SS equals or un- 
derperforms single-locus regression, a result similar to that 
in IB above, suggesting that this modeling is suboptimal 
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Fig. 9. ROC curves for simulation study 2 with 5 true signals: small SNP effects in a hit region of strong LD. Each curve is based on 100 
simulations. 
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Fig. 10. AUC for simulation two with one to seven true signals: small SNP effects in a hit region of strong LD. Each AUC is estimated 
from 100 simulations. 



in SS. Figure 10 summarizes results from all 700 simulation 
trials and shows the effect of varying the number of true 
signals. With one true signal, single locus regression equals 
or betters any other method; but as the number of true sig- 
nals increases, its advantage over multiple locus methods 
diminishes. In particular, for four or more loci LLARRMA 
consistently outperforms in the initial AUC, whereas oracle 
SS consistently underperforms both LLARRMA and single 
locus regression. 



DISCUSSION 

We present a general approach for characterizing f requen- 
tist variability in LASSO-based model choice, LLARRMA, 
and apply it to a problem for which it should be well suited: 
discriminating true from false signals among a set of SNP 
predictors that are often highly correlated. In doing so, we 
evaluate two criteria for automatically choosing the LASSO 
penalization parameter X (permutation and complement de- 
viance selection), demonstrate potential superiority of local 
vs. global regularization in subagging (through comparison 
with SS), and propose a natural way to combine resampling 



aggregation with multiple imputation to account more com- 
prehensively for different sources of variability in model 
choice. 

LLARRMA's intended use is in focused analyses on hit 
regions that have been already identified during whole- 
genome analysis. Rather than replacing single locus regres- 
sion, its value lies in what it subsequently adds to that analy- 
sis. When there are few causal variants and mild LD, the best 
LLARRMA methods add little. But when there are many, 
applying LLARRMA produces a top set of loci that is, on 
average, enriched for true signals relative to that obtained 
by single SNP association (cf. > 4 true loci in simulations 
2B). Of our two alternatives for automatically selecting the 
penalty X., we found a slight but consistent advantage of 
permutation selection (modified from Ayers and Cordell 
[2010]). This could reflect its discovery-based motivation 
matching our discovery-oriented evaluation, and does not 
preclude complement deviance, our alternative, being su- 
perior in predictive settings. 

We explored the use of SS [Meinshausen and Buhlmann, 
2010] in this context but find it no better than, and usu- 
ally inferior to, LLARRMA, despite the fact that our eval- 
uation of SS is based on an optimal calibration of its (un- 
specified) penalization parameter. One explanation is that 
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SS's use of a single global X for all subsamples underfits 
the data, in that it fails to accommodate structural differ- 
ences between LASSO paths fit to different subsamples. 
The local automatic regularization in LLARRMA implies 
a different perspective: that \ is a parameter intrinsic to, 
and only meaningful in the context of, a single LASSO 
path on a single (subsampled) realization of the data. An- 
other factor could be our evaluation scheme: by calculating 
power and FPR at different thresholds of RMIP, we (rea- 
sonably, in our view) assume that RMIPs should be com- 
parable across simulated datasets. But when we thresh- 
old instead on the ranks of the RMIPs within simulated 
datasets (such that the best RMIP in trial s = 1 is equiv- 
alent to the best RMIP in s = 2), the performance gap 
between SS and LLARRMA narrows (data not shown), 
suggesting SS RMIPs are discriminatory but their abso- 
lute values are less comparable across studies. Lastly, al- 
though our implementation of SS uses subsample pro- 
portion 4> = 2/3 rather than the original <t» = 1/2 of Mein- 
shausen and Biihlmann [2010], our preliminary studies (not 
shown) do not suggest this biases comparisons with LLAR- 
RMA. 

Alexander and Lange [2011] recently demonstrated SS's 
inferiority to single locus regression for identifying un- 
linked quantitative trait loci (QTL) in whole-genome as- 
sociation (also using data from WTCCC [2007]). The 
weakness we identify in SS may help explain that poor 
performance. Nonetheless, we believe that to expect SS (or 
LLARRMA for that matter) to beat single locus regression 
at its own game is not only optimistic, especially given the 
near optimality of marginal approaches suggested by Fan 
and Lv [2008], but also distracts from the potential advan- 
tages of multipredictor shrinkage for disentangling highly 
correlated signals in LD blocks following an initial single 
locus scan. 

Multiple imputation is simply accommodated by our 
resampling scheme, with draws from an arbitrarily com- 
plex imputation algorithm dovetailing naturally with the 
drawing of each subsample from the full data. However, 
our results suggest that even with 10% missing geno- 
types multiple-locus inference is served just as well by 
simpler "plug-in" imputation estimates (hard and dosage). 
Nonetheless, we advocate multiple imputation where pos- 
sible because it more comprehensively models imputation 
uncertainty (among genotypes or other covariates) that 
could be more pronounced in messier datasets. 

Resample aggregation techniques, such as bootstrap ag- 
gregation ("bagging"; [Breiman, 1996]) or subsample ag- 
gregation ("subagging"; Biihlmann and Yu, 2002]), have 
been found to produce estimates of 7 that are more 
stable than from a single estimation run in the sense 
that those estimates have lower frequentist risk under 
squared error loss [Biihlmann and Yu 2002]. We prefer 
subagging (as in Meinshausen and Biihlmann [2010], Val- 
dar et al. [2009]) to bagging for two reasons. First, the- 
oretical results in Politis et al. [1999, pp. 47-51] sug- 
gest that subsampling is less efficient but more gen- 
eral than bootstrapping; specifically, that whereas boot- 
strap methods must often assume that the estimated 
statistic is at least locally smooth (which the true or 
sampled 7 is not), this assumption is not needed for 
subsampling. Second, resampling individuals with re- 
placement (bootstrapping) poorly approximates varia- 
tion in GWAS samples because it produces frequent 
duplicates in a scenario where observing multiple in- 



dividuals with identical genetic composition would be 
improbable. 

In our simulations, we measure performance by a sim- 
ple but stringent criterion. We define a "true signal" as the 
SNP that most strongly tags an underlying causal variant, 
consider all other SNPs as "background," and regard suc- 
cess as discrimination of one from the other. In doing so, we 
provide a criterion for assessment that is unambiguous and 
in line with many comparable studies [e.g., Alexander and 
Lange, 2011; Basu et al, 2011; Basu and Pan, 2011; Chun 
et al., 2011]. Nonetheless, important alternatives exist. A 
more sophisticated assessment of accuracy, for example, 
might use a softer criterion that counts as true all SNPs 
within a given distance or LD-cutoff of a causal locus [e.g., 
Ayers and Cordell, 2010; Shi et al., 2011; Zhang, 2012]. That 
more nuanced approach can be helpful in evaluating meth- 
ods for genome- wide association, but would be counterpro- 
ductive in our setting for the following reasons. We target 
hit regions already identified through genome-wide asso- 
ciation, and in those regions attempt to discriminate causal 
from correlative variants in the presence of confounding 
LD. This goal is not easily reconciled with an assessment 
mechanism that allows a margin of error in SNP choice. 
For example, the definition of a useful cutoff for declar- 
ing a true positive is highly sensitive to the marker den- 
sity and the pervasiveness of LD in the hit region of in- 
terest, and both of these vary considerably between our 
two datasets. Moreover, given a suitable cutoff, it is de- 
batable (yet crucial for assessment) whether, for example, 
one or three hits within range of two causal loci would 
count as identifying both. Our hard criterion avoids such 
ambiguities, and provides a stark but clear assessment. It 
also makes our results particularly relevant to scenarios in 
which many of the examined variants are essentially in- 
distinguishable, such as for extremely dense genotype or 
sequence data. 

We introduce LLARRMA as an approach for characteriz- 
ing model uncertainty when working within the frequen- 
tist paradigm. Alternative Bayesian variable selection ap- 
proaches do exist [e.g., Guan and Stephens, 2011; Wilson 
et al, 2010; Zhang, 2012]. At the request of a re- 
viewer, in Supporting information we assess the per- 
formance of a contemporary Bayesian variable selection 
method (PIMASS; Guan and Stephens, [2011]) within 
our simulation framework, comparing it with LLAR- 
RMA, oracle SS, single-locus regression, and forward se- 
lection. 

Although we describe LLARRMA in the case-control 
setting using the logistic model, it is easily extended 
to the analysis of quantitative traits or any response to 
which the LASSO can be applied. Similarly, although we 
model under the simplistic assumption of additive effects 
and no local epistasis, these assumptions could be re- 
laxed by a more sophisticated specification of locus effects, 
for example, using the group LASSO [Meier et al, 2008; 
Yuan and Lin, 2006] or a similar structured penalization 
scheme. 

In summary, we describe an approach for characterizing 
frequentist variability of model choice in binary data that 
can be usefully applied to the reprioritization of SNPs in hit 
regions of a case-control GWAS. The method uses LLAR- 
RMA and integrates well with schemes for imputation of 
missing data. The authors will provide an implementation 
of LLARRMA an R-package R/llarrma as soon as is practi- 
cable. 
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